1 Introduction

1) Why this topic

With the development of Internet and signal techniques, smartphone plays an essential role in people’s life. Ten years ago, people only used mobile to text and make phone calls, but nowadays smartphones can do all kinds of stuff, such as surfing the internet, sending the emails, video chatting with friends or even paying bills etc. So, we want to see how popular mobile is in different countries and what patterns of internet request by mobile across time in one day are. Also, the mobile operators can get the locations of the users when they make requests, but with some signal uncertainty, it is also worthy to study the geo-location information and see the spatial patterns of signal accuracy.

In this project, we will focus on some specific areas in the world. The middle east countries are all close to each other while they have very different national conditions. Russia has vast territory so the mobile signal may vary across the country. The study of these two areas may produce very interesting results so we used mobile request data from these countries to exam the use of mobile and signal geo-location information in Middle East and Russian.

We used the data visualization and statistical method to understand mobile popularity and signal accuracy in different countries, found the underlying reason of different popularity and accuracy. In addition, we also developed a mobile signal tracking site called Mobile Signal. It lists mobile usage records across time for three days in a row both in Middle East and Russia as well as tools and information needed to be able to have a comprehensive understanding of the variations across countries or areas respectively.

2) Research Questions

From the data, we try to figure out four questions in popularity of mobile:
a. Number of mobile users in different countries
b. Prevalence of mobile internet usage in different countries
c. Development of mobile industry in different countries
d. Usage volume of mobile across time in one day

Also with this data, we can study more interesting topics using geo-location information:
a. Pattern of signal accuracy in different Mideast countries across time
b. The movement pattern of mobile internet request of some specific groups of people

3) How to find the data

The data was from Zhirui Wang’s intern company. It contains data of three days in 2016 from middle east countries and Russia. The original data is 17.6G. We upload them to Google Drive. We also used data from World Population and GDP data from World Bank.

2 Team members and distributed contributions

We have four group members: Zhirui Wang (zw2389), Xikai Chen (xc2358), Yaqing Wang (yw2902) and Chang Pan (cp2923). To start, Zhirui did all data cleaning, and plotted barcharts to visualize the number of records and accuracy. Yaqing then did analysis on the barcharts in order to get a basic idea of the dataset. Further, Zhirui made spatial visualizations of ego-location data in R to generate two maps: the number of records in Middle East and Russia. After the basic steps, Xikai created a shiny App and embedded Zhirui’s plots and maps in the app. He and Chang also managed to use animations to visualize the changes of dots in the maps over time. In addition, they optimized the app to enable filters to capture mobile usages in specific landmarks in Russia. At last, for the report, Yaqing is responsible for Introduction, Team and Middle East part of Main Analysis, Zhirui is responsible for Analysis of Data Quality, Russia part of Main Analysis, and Chang and Xikai are responsible for Executive Summary.

3 Data Quality

This is a data set of mobile signal data. It consists of two parts, the first part is from 22 countries in the Middle East during Dec 10-13 2016, and the second part is from Russia during Dec 10-12. First let us look at the data quality of the Mideast data:

The original data of Mideast has 1.93 Gigabytes, it has no column names and is tab delimited. The columns of the data are: Timestamp, IP Address, User ID, Latitude, Longitude, Accuracy, Country. The time stamp has precision to the second, while we will mostly focus on the analysis on the hour basis so we will convert it into hours later. The IP Address actually does not give much information under our analysis purpose, we will drop this variable to keep a smaller consumption of memory. The User ID is unique for every mobile phone, in the analysis of Mideast, we mainly focus on country level, so the User ID is also not informative, we will drop this column as well. Also, due to the country level analysis, the longitude and latitude is useless as long as we have the country name column, we will also drop the longitude and latitude here. The accuracy is a measure of the ‘confidence interval’ that how far the recorded location information might differ from the actual location. When we open Google map, there will be a somewhat transparent sky blue circle around our location, and the radius of that circle is the accuracy here. Maybe it is better to call it ‘inaccuracy’ because the larger the number, the less the confidence we have about the actual location. But in my company called it ‘accuracy’ so let us just keep this way. The final column is the country of the mobile phone. It is an ISO two letter abbreviation of each country, so we have to web-scrape a code book to convert the abbreviation into actual country name. The data cleaning function is as following:

library(rvest)
codebook <- 'http://www.worldatlas.com/aatlas/ctycodes.htm' %>% 
  read_html %>% 
  html_nodes('table') %>% 
  .[[1]] %>% 
  html_table()

CleanData_Mideast <- function(input,output){
  library(tidyverse)
  x <- read_delim(input,
    "\t", escape_double = FALSE, col_names = FALSE,
    trim_ws = TRUE)
  colnames(x) <- c('Timestamp','IP Address','User ID','Latitude','Longitude','Accuracy','Country')
  x$COUNTRY <- codebook$COUNTRY[match(x$Country,codebook$`A2 (ISO)`)]
  x <- x %>% select(Timestamp,Accuracy,COUNTRY)
  write_csv(x,output)
}
CleanData_Mideast("C:/Users/wang_/Desktop/2016-12-11a.txt","C:/Users/wang_/Desktop/2016-12-11a_new.csv")
CleanData_Mideast("C:/Users/wang_/Desktop/2016-12-12a.txt","C:/Users/wang_/Desktop/2016-12-12a_new.csv")
CleanData_Mideast("C:/Users/wang_/Desktop/2016-12-13a.txt","C:/Users/wang_/Desktop/2016-12-13a_new.csv")

Then let us look at the data of Russia:

The original data of Russia has 15.6 Gigabytes, it is also tab delimited and has no column names. The columns of the data are Time Stamp, IP Address, User ID, Latitude, Longitude, Accuracy, Wi-Fi Networks Nearby, GSM Towers, Country. The Time Stamp, IP address and Accuracy are the same as the data of Mideast, we will remain the same processing method as in the Mideast part. In this part of analysis, we will focus on individual level, so the User ID, Latitude and Longitude information is very crucial, we will not drop them as in the previous part. The Wi-Fi Networks Nearby, GSM Towers and Country does not provide useful information for us to analyze, we will drop them. Thus, the code for cleaning the data is as follow:

CleanData_Russia <- function(input,output){
  library(tidyverse)
  x <- read_delim(input,
                  "\t", escape_double = FALSE, col_names = FALSE,
                  trim_ws = TRUE)
  colnames(x) <- c('Time Stamp','IP Address','User ID','Latitude','Longitude','Accuracy','Wifi Networks Nearby','GSM Towers','Country')
  x <- x %>% select(`Time Stamp`,`User ID`,`Latitude`,`Longitude`,`Accuracy`)
  write_csv(x,output)
}
CleanData_Russia("C:/Users/wang_/Desktop/2016-12-10b.txt","C:/Users/wang_/Desktop/2016-12-10b_new.csv")
CleanData_Russia("C:/Users/wang_/Desktop/2016-12-11b.txt","C:/Users/wang_/Desktop/2016-12-11b_new.csv")
CleanData_Russia("C:/Users/wang_/Desktop/2016-12-12b.txt","C:/Users/wang_/Desktop/2016-12-12b_new.csv")

The data is the records from telecom company, they by nature have no missing value or outliers. However, there are records that do not belong to one day appear in the data of that day. In the analysis process, we will do the data cleaning process to drop all these rows.

We have integrated the first 1000 rows of data from each day in Mideast and Russia into our shiny app under the data tab. The code of the shiny app is on the Github.

4 Executive Summary

Mobile usage is an appealing topic. On the micro aspect, studying one’s internet request records can help us know about one’s living circle, lifestyle and etc,. on the macro aspect, analyzing people’s mobile request data as well as mobile signal accuracy reveals a lot about the country’s development, level of wealth, or even the countries’ infrastructure development.

In this project, we focus on some specific areas in the world. The middle east countries are all close to each other while they have very different national conditions. Russia has vast territory so the mobile signal may vary across the country. The study of these two areas obtains some interesting results, and we are going to present some revealing findings in this short summary.

First, we took a look at Middle East countries. Here is simply a plot showing the number of mobile usage records in different countries.

From the plot above we can see that Turkey, Saudi Arabia and United Arab Emirates have the highest number of records, approaching 1.5 million. All of these three countries rank the top in the Middle East Total GDP Ranking list, so it makes sense that these countries have the largest numbers of active mobile users. In general, the smaller or the poorer countries tend to have fewer records. However, the number of records in small countries like Cyprus is more than four times of that in large countries like Iran, which is totally unexpected.

However, given the uniqueness of Cyprus, it make sense. As a small island country with great tourism resources, Cyprus has tourists from nearby countries all year around. People love to go there for little breaks. Especially for Europeans, Cyprus is just a short-flight away, and has much lower living expense than most European countries and other Mideast tourism countries like United Arab Emirates. Therefore it is highly possible that a large portion of the number of records comes from foreign tourists.

In addition, since Cyprus is a small island country which do not have much potential for agriculture or industry, it is reasonable to suggest that mobile industry has a higher relative development in Cyprus than in other Middle East countries, which is consistent with the earlier reasoning that Cyprus have prosperous tourism. However, there are still other possible explanations for this pattern. For example, the data itself might come from a single carrier, which could be based in Cyprus. Then, the huge volume of records in Cyprus would make more sense, since in other countries, people may use other major carriers and such great number of records is invisible in this dataset.

Besides the number of records, the other interesting feature in this data set is the accuracy. It is a measure of the ‘confidence interval’ that the recorded location information differs from the actual one. So actually the larger the value is, the less confidence we have about the location.

From the plot above, there are four countries have mean accuracy over 2000 meters: Libya, Iraq, South Sudan and Democratic Republic of Congo. All of them are in upheaval or experienced huge turbulence. By search on the internet, we notice that the smoke from the wars also affect the mobile signal, hence, affect the accuracy. Additionally, in unstable countries like these, the base stations are easily getting damaged, and there is no extra money, people, resources or motivations for someone to develop the mobile industry, both mobile phones and base stations. So they are expected to have highest accuracy.

On the other hand, Cyprus again beats other countries to be the best in terms of accuracy in Middle East. Besides its stable political situation, the majority of the country is plain, which is beneficial for base stations. Furthermore, as a popular tourist destination, it has the motivation to build a better environment. For example, high quality infrastructures, for internet users in order to attract more tourists. In turn, tourist who probably come from richer countries would use high-quality cell phones. And all of these could lead to better accuracy.

Now, we move to Russia. Here, we made a closer look at the moving pattern of specific groups, and found some interesting patterns of tourist on Moscow in the map.

Take this plot as an example, the blue dots represent the positions of a certain person, which clearly reveal the moving pattern of the person. Since they all perfectly lined up with each other, it seems that the person was queuing in the line for a traveling sight or maybe just a restaurant. However, similar lining patterns were not everywhere in the map as it was supposed to be. People do not line up to purchase the tickets or wait to enter the sights, so it is safe for us to conclude that this period might not be during a busy traveling season in Moscow, and tourist could go wherever they want to visit without waiting a line.

Another finding is that many of the cluster of internet request is on a bridge or at the waterfront, this may be due to the fact that bridges and waterfront are great places for photography, people may take photos there using their mobile phones, and then upload onto the social media, which requires internet request.

To sum it up, in Middle East, the larger or the richer a country is, the more mobile usage records and better accuracy it has, with one exception: Cyprus. As a great tourism country, it attracts many foreign visitors, who make great contributions to the country’s mobile usage records and in turn bring high-tech smartphones to motivate Cyprus for better base stations. Also, as expected, countries in upheaval or even in war have fewer records and larger accuracy. As in Russia, based on the internet request showing in the map, we can conclude that it is highly likely a off-season for tourism in Moscow, and when people visit natural sights such as pond, they tend to use cell phones for internet more than often.

5 Main Analysis

Middle East

library(plotly)
library(tidyverse)
library(gganimate)
library(lubridate)
library(forcats)
library(biglm)
library(lmtest)
library(knitr)
library(leaflet)
X2016_12_11a_new <- read_csv("C:/Users/wang_/Desktop/2016-12-11a_new.csv",progress=F)
X2016_12_12a_new <- read_csv("C:/Users/wang_/Desktop/2016-12-12a_new.csv",progress=F)
X2016_12_13a_new <- read_csv("C:/Users/wang_/Desktop/2016-12-13a_new.csv",progress=F)
x <- bind_rows(X2016_12_11a_new,X2016_12_12a_new,X2016_12_13a_new)

Number of records

First we took a look at the number of records in each Middle East country. For the purpose of comparison, we have two options: barcharts and piecharts. However, with more than 20 countries in total, it is difficult to identify the slice for a country with small portion, or even compare it with a smaller slice using piecharts, therefore, we settled with barcharts.

x_group_count <- x %>% 
  filter(Timestamp>as.Date('2016-12-12 00:00:00 UTC')) %>% 
  group_by(COUNTRY) %>% 
  summarise(count=n()) %>% 
  arrange(count)
(x_group_count %>% 
  ggplot(aes(y=count,x=as_factor(COUNTRY)))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Number of Records')+
  xlab('Country Name')+
  ggtitle('Number of Records by Country')) %>% 
  ggplotly

In order to draw this graph, first we filter the data to drop the rows that does not belong to these three days. Then we group the data by country and count the number of each country. We want the bar chart to be sorted by the number of records, so we arrange the count variable and use as_factor() function to remain this order when passing to ggplot. We also flip the coordinates for better visualization of the country names.

From the plot above, we can see that Turkey, Saudi Arabia and United Arab Emirates have the highest number of records, approaching 1.5 million. All of the three countries rank the top in the Middle East Total GDP Ranking list, so it makes sense that these countries have the largest numbers of active mobile users. In general, the smaller or the poorer countries tend to have fewer records. However, the number of records in small countries like Cyprus is more than four times of that in large countries like Iran.

There are two obvious factors associated with the number of records: national population and GDP, and we are going to explore them one by one. Here, we first start with population. By dividing the number of records by the national population, we obtain the proportion of the active users in the national population, which could be a measure of how prevalent mobile usage is in each country.

population <- "http://www.worldometers.info/world-population/population-by-country/" %>% 
  read_html %>% 
  html_nodes('table') %>% 
  .[[1]] %>% 
  html_table() %>% 
  .[,2:3]
colnames(population)[2] <- 'Population'
population$Population <- population$Population %>% gsub(',','',.) %>% as.numeric()
a <- match(x_group_count$COUNTRY,population$`Country (or dependency)`)
a[a %>% is.na %>% which] <- c(16,61,121,17)
x_group_count$Population <- population$Population[a]
(x_group_count %>% 
  mutate(percentage=100*count/Population) %>% 
  ggplot(aes(y=percentage,x=as_factor(COUNTRY)))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Number of Records/Population')+
  xlab('Country Name')+
  ggtitle('Number of Records to Population by Country')) %>% 
  ggplotly

In order to draw this graph, we first scrape world population information from Internet, and then match the table onto our original data according to country name. We divide the number of records by the population and times 100 to get the percentage, then draw the graph use the same technique as the previous part. An alternative option is to use the number of unique ID as the numerator to get the percentage. However, we want the number of internet request each user made can also be included in this index, thus we choose to use the number of records to get the percentage.

We can know from the plot that Turkey, Saudi Arabia and United Arab Emirates do not rank the highest any more, instead Cyprus has an enormously larger number than other countries: nearly a half of the country population has made internet request in three days. We think this is because the fact that Cyprus is a small island country with great tourism resources. People from nearby countries love to go there for a little break, especially Europeans. So it is highly possible that the large number of records consist of great many of foreign tourists. Other than Cyprus, it seems that richer countries generally have more records per person than poorer ones.

To confirm this finding, we choose to examine the ratio of number of active users to the total GDP, which can be a measure of the relative development of mobile industry to the whole industry in each country. We find the total GDP data from world bank data set, and divide the number of records to total GDP.

url <- 'http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv'
temp <- tempfile()
download.file(url, temp, mode="wb")
unzip(temp, "API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv")
totalgdp <- read_csv("API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv",skip = 4)[,c(1,2,60)]
unlink(temp)
totalgdp$Country <- codebook$COUNTRY[match(totalgdp$`Country Code`,codebook$`A3 (UN)`)]
a <- match(x_group_count$COUNTRY,totalgdp$Country)
x_group_count$gdp <- totalgdp$`2015`[a]
(x_group_count %>% 
  mutate(percentage=count/gdp) %>% 
  ggplot(aes(y=percentage,x=as_factor(COUNTRY)))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Number of Records/Total GDP')+
  xlab('Country Name')+
  ggtitle('Number of Records to Total GDP by Country')) %>% 
  ggplotly

In order to draw this graph, we first automatically download the GDP data from World Bank Open Data, use the same way as above to match it to the original data, and then divide the number of records by total GDP. An alternative option here is to use GDP per capita, and then we will get a measure of an Engel-coefficient-like index of the mobile phone usage. However here we want to focus on the country level analysis, so we just go with the relative development of mobile industry.

We can see that again Cyprus surpasses the other countries by a huge amount. Since it is a small island country which do not have much potential for agriculture or industry, it is reasonable to suggest that mobile industry has a higher relative development in Cyprus than in other Middle East countries, which is consistent with the earlier reasoning that Cyprus have prosperous tourism. However, there are still other possible explanations for this pattern, for example, the data itself might come from a single carrier, which could be based in Cyprus. Then, the huge volume of records in Cyprus would make more sense, since in other countries, people may use other major carriers and such great number of records is invisible in this dataset.

After comparing the data across countries, we now can compare them across time.

First, we want to see how the number of records varies across time, so we plot two animation interactive graphs that evolve as time goes by.

This is a screen shot of the interactive map(In shiny app, it is Number of Records in Mideast map tab):

We use leaflet to draw this Choropleths graph. We download geojson data of the world country polygons, use geojsonio::geojson_read() to read in the data as a SpatialPolygonsDataFrame object, and then concatenate our count data into this object. Because the value of number of records vary a lot across countries, we have to use uneven color bar to visualize the values. We add a slider selector to the graph, to select the hour of the data. We also make the graph animation, which controlled by a “play/pause” button. We use addPolygons() function to draw SpatialPolygonsDataFrame onto the map. We set many parameters in this function to make this graph more pretty, such as making the polygon transparent in order to see the country name in the background clearly, make country boundary white and dashed line in order to make it looks like hand-made, and add description of each country when the cursor are on that polygon.

This is a screen shot of the interactive barchart(In shiny app, it is Number of Records in Mideast barchart tab):

This graph is very similar to the static bar chart of the number of records, except we add another dimension of hour into this graph to make it animation. We also add a slider selector to the graph, to select the hour of the data and use filter() function to select the data of that hour, and then render the plot.

From the plot above, we can see that the number of records touch the bottom at dawn, start to increase as time goes by, and then reach the peak at midnight, which makes perfect sense. At dawn, there are few people still awake, while almost all people are asleep. Then, when people start to get up and begin the day, people start to use mobile for all kinds of things. However, in the daytime, people have to work, study or just run errands, so after they getting off work, finishing schoolwork for the day, having a great dinner with family, putting their children to bed, the mobile usage peak occurs.

Accuracy

Besides the number of records, the other interesting feature in this data set is the accuracy. It can be seen as the ‘confidence interval’ of the base stations in different countries, which can also be affected by the mobile device itself.

x_group_accuracy <- x %>% 
  filter(Timestamp>as.Date('2016-12-12 00:00:00 UTC')) %>% 
  group_by(COUNTRY) %>% 
  summarise(mean_Accuracy=mean(Accuracy),st_accuracy=sd(Accuracy))
(x_group_accuracy %>% 
  arrange(mean_Accuracy) %>% 
  ggplot(aes(y=mean_Accuracy,x=as_factor(COUNTRY)))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Mean Accuracy')+
  xlab('Country Name')+
  ggtitle('Mean Accuracy by Country')) %>% 
  ggplotly

In order to draw this graph, first we filter the data to drop the rows that does not belong to these three days. Then we group the data by country and calculate the mean and standard deviation of accuracy of each country. We also sort the bar chart by the mean accuracy.

From the plot above, there are four countries have mean accuracy over 2000 meters: Libya, Iraq, South Sudan and Democratic Republic of Congo. All of them are in upheaval or experienced huge turbulence. The smoke from the wars could really affect the mobile signal, hence, affect the accuracy. Additionally, in unstable countries like these, there is no extra money, people, resources or motivations for someone to develop the mobile industry, both mobile phones and base stations. So they are expected to have highest accuracy. On the other hand, Cyprus again beats other countries to be the best in terms of accuracy in Middle East. Besides its stable political situation, the majority of the country is plain, which is beneficial for base stations. Furthermore, as a popular tourist destination, it has the motivation to build a better environment for internet users in order to attract more tourists.

Also we can visualize the mean accuracy change by time.

This is a screen shot of the interactive barchart(In shiny app, it is Mean Accuracy in Mideast barchart tab):

This graph is very similar to the static bar chart of the mean accuracy, except we add another dimension of hour into this graph to make it animation. We also add a slider selector to the graph, to select the hour of the data and use filter() function to select the data of that hour, and then render the plot.

It does not seem to be a clear pattern of how the mean accuracy evolve over time, but we can see that South Sudan and Libya varies a lot across time. So we plot the standard deviation of the accuracy across country and across time to have a more clear view.

(x_group_accuracy %>% 
  arrange(st_accuracy) %>% 
  ggplot(aes(y=st_accuracy,x=as_factor(COUNTRY)))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Standard Deviation of Accuracy')+
  xlab('Country Name')+
  ggtitle('Standard Deviation of Accuracy by Country')) %>% 
  ggplotly

We use the data frame we generated from the previous part as input to plot this graph. We ordered the countries by the value of standard deviation of accuracy and draw the bar chart.

South Sudan has the highest standard deviation, followed by Libya and Syrian, which have much to do with their unstable political situations. However, the rest of countries have similarly high standard deviation. So it is possible that Middle East countries generally do not have sophisticated techniques and well-developed infrastructure, which leads to the poor accuracy with great standard deviation.

(x %>% 
  filter(Timestamp>as.Date('2016-12-12 00:00:00 UTC')) %>% 
  mutate(Hour=hour(Timestamp)) %>% 
  group_by(Hour) %>% 
  summarise(st_accuracy=sd(Accuracy)) %>% 
  mutate(hour=Hour %>% as.character) %>% 
  ggplot(aes(y=st_accuracy,x=hour %>% as_factor))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Standard Deviation of Accuracy')+
  xlab('Hour')+
  ggtitle('Standard Deviation of Accuracy across Time')) %>% 
  ggplotly

In order to draw this graph, first we filter the data to drop the rows that does not belong to these three days. Then we group the data by hour and calculate the standard deviation of accuracy of each hour. We also have to use as_factor to maintain the numerical order of the hours, or in ggplot after we convert the hour into character, it will arrange 10, 11 and 12 after 1, not after 9.

From the earlier analysis on the number of records across time, we naturally assumed that the accuracy has the similar pattern. However, based on the plot above, there is really no significant difference in standard deviation of accuracy across time. So there is a strong indication that accuracy mainly depends on base stations itself other than how many people are using the mobile.

Russia

Number of Records and Accuracy

First we draw a bar chart of the number of records per hour

X2016_12_10b_new <- read_csv("C:/Users/wang_/Desktop/2016-12-10b_new.csv",progress=F)
X2016_12_11b_new <- read_csv("C:/Users/wang_/Desktop/2016-12-11b_new.csv",progress=F)
X2016_12_12b_new <- read_csv("C:/Users/wang_/Desktop/2016-12-12b_new.csv",progress=F)
x <- bind_rows(X2016_12_10b_new,X2016_12_11b_new,X2016_12_12b_new)
(x %>% 
  mutate(Hour=hour(`Time Stamp`)) %>% 
  group_by(Hour) %>% 
  summarise(count=n()) %>% 
  ggplot(aes(y=count,x=Hour %>% as.character))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Number of Records')+
  xlab('Hour')+
  ggtitle('Number of Records across Time')) %>% 
  ggplotly

The graph plotting process is almost the similar to the counterpart in Mideast. The only difference is that there is no data that is not in that day, so we do not need to clean it first.

The internet usage behavior in Russia is very different from that in Mideast. The number of users peaks in the afternoon, decreases as time goes by and reaches the bottom at night, which is the opposite of the Mideast. It might be that Russians tend to start their day early, and finish their jobs around the noon. Then after the early lunch, they like to have some afternoon tea with colleagues or friends, where they chat, laugh, do selfies, post on social media like Facebook or Instagram. Also, Russians are kind of traditional, so after work or school, they tend to go home and have some quality time with families, like watching a movie together, reading books to children, or even taking a dance lesson with spouse etc.

We can also see this relationship when we plot that on the map. This is a screen shot of the animation map (In shiny app, it is Accuracy in Russia map tab):

In this graph we have billions of points to visualize, when we try to use interactive graphing to visualize it, the R will crash halfway, thus we have to use static graphing method to plot the map, and combine the map over time to make animation. Here we use ggmap package to draw the static map. The range of the Accuracy is very huge, so we use log-transformation on the Accuracy in order to get a more clear color visualization. It will took several minutes to draw a map for one of the ten hours, so we use parallel computing method in R to plot the map in parallel in order to plot the map faster. We save each map as png, and read each hour’s map in shiny using an hour slider selector.

As shown in the map, the mobile phone users are all clustered in the south east part of the country, but there is even no single request in most part of the Siberia. It is the same as we expected since the majority of Russians live in the Southeastern region, while Siberia is bare due to its unbearable weather.

Also, we can see the pattern discussed earlier clearly showing in the map: the number of dots peaks in the afternoon and decreases as time goes by.

In addition, we can see from the map that in the south east part of the country there are many red dots, while in the north east part most of the dots are light yellow. To quantify this statement, we can generate a linear regression to see the relationship between accuracy and latitude and longitude. \[ Accuracy = \beta_{0} + \beta_{1}Latitude + \beta_{2}Longitude + \beta_{3}Latitude\times Longitude + \epsilon \]

reg <- biglm(Accuracy~Lattitude+Longitude+Lattitude*Longitude,data=x)
reg %>% coeftest %>% as.table %>% kable
Estimate Std. Error z value Pr(>|z|)
(Intercept) 93.4328276 1.3290097 70.302595 0
Lattitude -0.3636014 0.0244292 -14.883893 0
Longitude -0.3164779 0.0179775 -17.604094 0
Lattitude:Longitude 0.0033318 0.0003352 9.940549 0

In this linear regression we use biglm() instead of lm() because we have billions of data points, the regular lm() function will generate a huge object up to 9 Gigabytes and eat up all the memory. However biglm() will overcome this problem by gradually update the coefficient matrix by calculate the data row by row.

And the coefficient of the regression shows it is the same as we visualized. As latitude or longitude increases 1 degree, the accuracy will decrease 0.36 and 0.32 respectively, both statistically significant. The interactive term is significant as well, while the magnitude is so small and we can ignore that. We know that the latitude can vary 40 degrees and the longitude can vary 140 degrees across Russia, which can bring substantial difference in the accuracy across broad.

Internet Request Motion in Moscow

In this part, we want to visualize the motion of the people who make mobile internet request in Moscow. We select the User IDs that have been to five famous landmarks in Moscow, who can be seen as tourists in Moscow, and visualize their mobile internet request across hours in one day. Our data only contains hour 13-22, so we can only visualize these ten hours. Notice that there may be hundreds of internet requests, but the number of unique ID is around 20, and is shown on the upper-right panel. The selector of hour and landmark is also in this panel. The description of each data point will be shown after the icon of that point is clicked, which will be User ID and Time Stamp.

This is a screen shot of the interactive map (In shiny app, it is Internet Request in Moscow map tab):

To draw this map and read it clearly, we have to find a map that provide English location names on Russia territory. Unfortunately all the default map providers in leaflet does not have English location names on foreign countries, thus we use Mapbox as our custom map provider to plot this map. We want to see the clusterness of the internet request, so we use marker clusters when drawing individual points on to the map, they will show the number of points in one cluster. When we zoom in, the cluster will spread out and finally showing individual points.

First let us look at the general pattern of internet request of those people who has been into these five landmarks. The numbers of unique ID and the numbers of request peaks at afternoon and hits the bottom at night with no record in hour 22. This might be due to at night tourists go back to their hotel and connect to the hotel WiFi, so they do not need to make internet request via telecom.

The next thing that is worthy noticing is there are two airports in Moscow: Vnukovo International airport on the bottom-left of the map, and the Sheremetyevo International Airport on the upper-left of the map. Almost all the internet requests are during hour 13 to 17. In each of these hours the ID in that airport is not the same, which means this airport is receiving new tourist every hour. There is no record in the evening at the airport, maybe it is due to people who arrive in Moscow at night will not go to the landmarks, probably go straight to their hotels.

Most of the internet records are clustered at the center of Moscow, the area around Kremlin and Red Square etc. This shows that no matter what place people visit in Moscow, the landmarks at the central Moscow are the must-visited places for most of the tourists to Moscow.

Also, we notice that many of the cluster of internet request is on a bridge or at the waterfront, this may be due to the bridge and waterfront are great places for photography, people may take photos there using their mobile phones, and then upload onto the social media, which requires internet request.

Now, we take a look at the airports from where our tourists come. It is a little bit surprised that people visiting St. Basil’s Cathedral all come from Vnukovo International Airport, since Sheremetyevo International Airport handles as twice as many passengers than it. Similar situation happens to tourist who visited GUM Department Store (aka. Glavny Univeralny Magazin), with only 7 records occur in Sheremetyevo International Airport. Another significant finding is that there is no Internet usage record in any airports from people who visit Patriarch’s Pond. It is possible that they are all local citizens who just go to the pond for a walk instead of tourists. After all, it is not a historical architecture you could not find anywhere other than Moscow or a must-go-to in Russia. As for Bolshoi Theatre and Moscow Metro, there is no noticeable difference between the numbers of records in both airports.

6 Conclusion

There may be some limitations in this project. First, the data is the mobile internet request data from one carrier in Mideast and Russia, we do not know whether the user of this carrier can represent all the people in that country, as there might be systematic bias in the sample, say, poor people tend to use cheaper carrier. Maybe we should collect more information about this carrier and the users’ characteristics of the carrier to make a further discussion.

Second, we use the number of request records in our analysis, which might reveal some important information in the data set. In the analysis of population and GDP of Mideast countries, I have mentioned some alternative ways that can use unique IDs to do the analysis. For the internet request motion in Moscow part, we can also use the unique IDs to visualize the motion of people rather than internet request, however, a person may have many requests in many places during one hour, it is hard to determine which point we should choose. Besides, the data is only available when people made internet request, thus it is hard to track a person’s location when he does not make internet request. But these alternatives can be the future directions to try.

Third, there are a few bugs in the shiny apps. When we play the animation interactive map, the background of the map is always flashing because each hour step the shiny generates a new map under the hood. We try to use the JavaScript plugin of leaflet: Leaflet.timeline, but it turns out to have even more bugs than shiny, so we finally choose to use Shiny’s own slider selector. Also, the slider selectors of the two maps is hard to drag, and the hardness is different on different group member’s computer. We think this might be a bug of shiny itself.

As far as the lesson learned, we could have done better in exploring the deep meaning or patterns of the data. Instead, we spent tons of hours trying to debug these bugs in shiny and javascript, or the more beatiful way to visualize our maps using Shiny. It was really time-consuming and have the least marginal benifit, and drifted us from analyzing the data deeper and discovering possible resaons or stories behind these findings.


  1. zw2389@columbia.edu

  2. xc2358@columbia.edu

  3. yw2902@columbia.edu

  4. cp2923@columbia.edu

---
title: "Analysis of Mobile Signal Data in Mideast and Russia"
author: Group Members:Zhirui Wang^[zw2389@columbia.edu], Xikai Chen^[xc2358@columbia.edu],
  Yaqing Wang^[yw2902@columbia.edu], Chang Pan^[cp2923@columbia.edu]
date: "New York, `r Sys.Date()`"
output:
  html_notebook:
    toc: yes
---

#1 Introduction
##1) Why this topic
With the development of Internet and signal techniques, smartphone plays an essential role in people’s life. Ten years ago, people only used mobile to text and make phone calls, but nowadays smartphones can do all kinds of stuff, such as surfing the internet, sending the emails, video chatting with friends or even paying bills etc. So, we want to see how popular mobile is in different countries and what patterns of internet request by mobile across time in one day are. Also, the mobile operators can get the locations of the users when they make requests, but with some signal uncertainty, it is also worthy to study the geo-location information and see the spatial patterns of signal accuracy.

In this project, we will focus on some specific areas in the world. The middle east countries are all close to each other while they have very different national conditions. Russia has vast territory so the mobile signal may vary across the country. The study of these two areas may produce very interesting results so we used mobile request data from these countries to exam the use of mobile and signal geo-location information in Middle East and Russian.

We used the data visualization and statistical method to understand mobile popularity and signal accuracy in different countries, found the underlying reason of different popularity and accuracy. In addition, we also developed a mobile signal tracking site called Mobile Signal. It lists mobile usage records across time for three days in a row both in Middle East and Russia as well as tools and information needed to be able to have a comprehensive understanding of the variations across countries or areas respectively.


##2) Research Questions
From the data, we try to figure out four questions in popularity of mobile:  
a. Number of mobile users in different countries  
b. Prevalence of mobile internet usage in different countries  
c. Development of mobile industry in different countries  
d. Usage volume of mobile across time in one day  

Also with this data, we can study more interesting topics using geo-location information:  
a. Pattern of signal accuracy in different Mideast countries across time  
b. The movement pattern of mobile internet request of some specific groups of people  

##3) How to find the data
The data was from Zhirui Wang’s intern company. It contains data of three days in 2016 from middle east countries and Russia. The original data is 17.6G. We upload them to [Google Drive](https://drive.google.com/open?id=0Bxhza798zLVYU0dETV9wYTNVbTQ).  We also used data from [World Population](http://www.worldometers.info/world-population/population-by-country/) and GDP data from [World Bank](http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv).

#2 Team members and distributed contributions
We have four group members: Zhirui Wang (zw2389), Xikai Chen (xc2358), Yaqing Wang (yw2902) and Chang Pan (cp2923). To start, Zhirui did all data cleaning, and plotted barcharts to visualize the number of records and accuracy. Yaqing then did analysis on the barcharts in order to get a basic idea of the dataset. Further, Zhirui made spatial visualizations of ego-location data in R to generate two maps: the number of records in Middle East and Russia. After the basic steps, Xikai created a shiny App and embedded Zhirui’s plots and maps in the app. He and Chang also managed to use animations to visualize the changes of dots in the maps over time. In addition, they optimized the app to enable filters to capture mobile usages in specific landmarks in Russia. At last, for the report, Yaqing is responsible for Introduction, Team and Middle East part of Main Analysis, Zhirui is responsible for Analysis of Data Quality, Russia part of Main Analysis, and Chang and Xikai are responsible for Executive Summary.

#3 Data Quality
This is a data set of mobile signal data. It consists of two parts, the first part is from 22 countries in the Middle East during Dec 10-13 2016, and the second part is from Russia during Dec 10-12. First let us look at the data quality of the Mideast data:

```{r, message=FALSE, warning=FALSE}
library(tidyverse)
X2016_12_11a <- read_delim("C:/Users/wang_/Desktop/2016-12-11a.txt","\t", escape_double = FALSE,col_names = FALSE,trim_ws = TRUE,progress = F) %>% head
X2016_12_11a
```
The original data of Mideast has 1.93 Gigabytes, it has no column names and is tab delimited. The columns of the data are: **Timestamp, IP Address, User ID, Latitude, Longitude, Accuracy, Country**. The time stamp has precision to the second, while we will mostly focus on the analysis on the hour basis so we will convert it into hours later. The IP Address actually does not give much information under our analysis purpose, we will drop this variable to keep a smaller consumption of memory. The User ID is unique for every mobile phone, in the analysis of Mideast, we mainly focus on country level, so the User ID is also not informative, we will drop this column as well. Also, due to the country level analysis, the longitude and latitude is useless as long as we have the country name column, we will also drop the longitude and latitude here. The accuracy is a measure of the ‘confidence interval’ that how far the recorded location information might differ from the actual location. When we open Google map, there will be a somewhat transparent sky blue circle around our location, and the radius of that circle is the accuracy here. Maybe it is better to call it ‘inaccuracy’ because the larger the number, the less the confidence we have about the actual location. But in my company called it ‘accuracy’ so let us just keep this way. The final column is the country of the mobile phone. It is an ISO two letter abbreviation of each country, so we have to web-scrape a code book to convert the abbreviation into actual country name. The data cleaning function is as following:
```{r,eval=F, eval=F, message=FALSE, warning=FALSE}
library(rvest)
codebook <- 'http://www.worldatlas.com/aatlas/ctycodes.htm' %>% 
  read_html %>% 
  html_nodes('table') %>% 
  .[[1]] %>% 
  html_table()

CleanData_Mideast <- function(input,output){
  library(tidyverse)
  x <- read_delim(input,
    "\t", escape_double = FALSE, col_names = FALSE,
    trim_ws = TRUE)
  colnames(x) <- c('Timestamp','IP Address','User ID','Latitude','Longitude','Accuracy','Country')
  x$COUNTRY <- codebook$COUNTRY[match(x$Country,codebook$`A2 (ISO)`)]
  x <- x %>% select(Timestamp,Accuracy,COUNTRY)
  write_csv(x,output)
}
CleanData_Mideast("C:/Users/wang_/Desktop/2016-12-11a.txt","C:/Users/wang_/Desktop/2016-12-11a_new.csv")
CleanData_Mideast("C:/Users/wang_/Desktop/2016-12-12a.txt","C:/Users/wang_/Desktop/2016-12-12a_new.csv")
CleanData_Mideast("C:/Users/wang_/Desktop/2016-12-13a.txt","C:/Users/wang_/Desktop/2016-12-13a_new.csv")
```

Then let us look at the data of Russia:
```{r, message=FALSE, warning=FALSE}
X2016_12_10b <- read_delim("C:/Users/wang_/Desktop/2016-12-10b.txt","\t", escape_double = FALSE,col_names = FALSE, trim_ws = TRUE,progress = F) %>% head
X2016_12_10b
```

The original data of Russia has 15.6 Gigabytes, it is also tab delimited and has no column names. The columns of the data are **Time Stamp**, **IP Address**, **User ID**, **Latitude**, **Longitude**, **Accuracy**, **Wi-Fi Networks Nearby**, **GSM Towers**, **Country**. The Time Stamp, IP address and Accuracy are the same as the data of Mideast, we will remain the same processing method as in the Mideast part. In this part of analysis, we will focus on individual level, so the User ID, Latitude and Longitude information is very crucial, we will not drop them as in the previous part. The Wi-Fi Networks Nearby, GSM Towers and Country does not provide useful information for us to analyze, we will drop them. Thus, the code for cleaning the data is as follow:
```{r,eval=F}
CleanData_Russia <- function(input,output){
  library(tidyverse)
  x <- read_delim(input,
                  "\t", escape_double = FALSE, col_names = FALSE,
                  trim_ws = TRUE)
  colnames(x) <- c('Time Stamp','IP Address','User ID','Latitude','Longitude','Accuracy','Wifi Networks Nearby','GSM Towers','Country')
  x <- x %>% select(`Time Stamp`,`User ID`,`Latitude`,`Longitude`,`Accuracy`)
  write_csv(x,output)
}
CleanData_Russia("C:/Users/wang_/Desktop/2016-12-10b.txt","C:/Users/wang_/Desktop/2016-12-10b_new.csv")
CleanData_Russia("C:/Users/wang_/Desktop/2016-12-11b.txt","C:/Users/wang_/Desktop/2016-12-11b_new.csv")
CleanData_Russia("C:/Users/wang_/Desktop/2016-12-12b.txt","C:/Users/wang_/Desktop/2016-12-12b_new.csv")
```

The data is the records from telecom company, they by nature have no missing value or outliers. However, there are records that do not belong to one day appear in the data of that day. In the analysis process, we will do the data cleaning process to drop all these rows.

We have integrated the first 1000 rows of data from each day in Mideast and Russia into our [shiny app](https://zhirui.shinyapps.io/mobile_signal/) under the **data** tab. The code of the shiny app is on the [Github](https://github.com/zhiruiwang/Mobile-Signal-Quest).

#4 Executive Summary
Mobile usage is an appealing topic. On the micro aspect, studying one’s internet request records can help us know about one’s living circle, lifestyle and etc,. on the macro aspect, analyzing people’s mobile request data as well as mobile signal accuracy reveals a lot about the country’s development, level of wealth, or even the countries’ infrastructure development.

In this project, we focus on some specific areas in the world. The middle east countries are all close to each other while they have very different national conditions. Russia has vast territory so the mobile signal may vary across the country. The study of these two areas obtains some interesting results, and we are going to present some revealing findings in this short summary.

First, we took a look at Middle East countries. Here is simply a plot showing the number of mobile usage records in different countries.

![](./GIF/summary1.jpeg)

From the plot above we can see that Turkey, Saudi Arabia and United Arab Emirates have the highest number of records, approaching 1.5 million. All of these three countries rank the top in the Middle East Total GDP Ranking list, so it makes sense that these countries have the largest numbers of active mobile users. In general, the smaller or the poorer countries tend to have fewer records. However, the number of records in small countries like Cyprus is more than four times of that in large countries like Iran, which is totally unexpected.

However, given the uniqueness of Cyprus, it make sense. As a small island country with great tourism resources, Cyprus has tourists from nearby countries all year around. People love to go there for little breaks. Especially for Europeans, Cyprus is just a short-flight away, and has much lower living expense than most European countries and other Mideast tourism countries like United Arab Emirates. Therefore it is highly possible that a large portion of the number of records comes from foreign tourists.

In addition, since Cyprus is a small island country which do not have much potential for agriculture or industry, it is reasonable to suggest that mobile industry has a higher relative development in Cyprus than in other Middle East countries, which is consistent with the earlier reasoning that Cyprus have prosperous tourism. However, there are still other possible explanations for this pattern. For example, the data itself might come from a single carrier, which could be based in Cyprus. Then, the huge volume of records in Cyprus would make more sense, since in other countries, people may use other major carriers and such great number of records is invisible in this dataset.

Besides the number of records, the other interesting feature in this data set is the accuracy. It is a measure of the ‘confidence interval’ that the recorded location information differs from the actual one. So actually the larger the value is, the less confidence we have about the location.

![](./GIF/summary2.jpeg)

From the plot above, there are four countries have mean accuracy over 2000 meters: Libya, Iraq, South Sudan and Democratic Republic of Congo. All of them are in upheaval or experienced huge turbulence. By search on the internet, we notice that the smoke from the wars also affect the mobile signal, hence, affect the accuracy. Additionally, in unstable countries like these, the base stations are easily getting damaged, and there is no extra money, people, resources or motivations for someone to develop the mobile industry, both mobile phones and base stations. So they are expected to have highest accuracy.

On the other hand, Cyprus again beats other countries to be the best in terms of accuracy in Middle East. Besides its stable political situation, the majority of the country is plain, which is beneficial for base stations. Furthermore, as a popular tourist destination, it has the motivation to build a better environment. For example, high quality infrastructures, for internet users in order to attract more tourists. In turn, tourist who probably come from richer countries would use high-quality cell phones. And all of these could lead to better accuracy.

Now, we move to Russia. Here, we made a closer look at the moving pattern of specific groups, and found some interesting patterns of tourist on Moscow in the map.

![](./GIF/summary3.jpg)

Take this plot as an example, the blue dots represent the positions of a certain person, which clearly reveal the moving pattern of the person. Since they all perfectly lined up with each other, it seems that the person was queuing in the line for a traveling sight or maybe just a restaurant. However, similar lining patterns were not everywhere in the map as it was supposed to be. People do not line up to purchase the tickets or wait to enter the sights, so it is safe for us to conclude that this period might not be during a busy traveling season in Moscow, and tourist could go wherever they want to visit without waiting a line.

Another finding is that many of the cluster of internet request is on a bridge or at the waterfront, this may be due to the fact that bridges and waterfront are great places for photography, people may take photos there using their mobile phones, and then upload onto the social media, which requires internet request.

To sum it up, in Middle East, the larger or the richer a country is, the more mobile usage records and better accuracy it has, with one exception: Cyprus. As a great tourism country, it attracts many foreign visitors, who make great contributions to the country’s mobile usage records and in turn bring high-tech smartphones to motivate Cyprus for better base stations. Also, as expected, countries in upheaval or even in war have fewer records and larger accuracy. As in Russia, based on the internet request showing in the map, we can conclude that it is highly likely a off-season for tourism in Moscow, and when people visit natural sights such as pond, they tend to use cell phones for internet more than often.

#5 Main Analysis
##Middle East
```{r, message=FALSE, warning=FALSE}
library(plotly)
library(tidyverse)
library(gganimate)
library(lubridate)
library(forcats)
library(biglm)
library(lmtest)
library(knitr)
library(leaflet)
X2016_12_11a_new <- read_csv("C:/Users/wang_/Desktop/2016-12-11a_new.csv",progress=F)
X2016_12_12a_new <- read_csv("C:/Users/wang_/Desktop/2016-12-12a_new.csv",progress=F)
X2016_12_13a_new <- read_csv("C:/Users/wang_/Desktop/2016-12-13a_new.csv",progress=F)
x <- bind_rows(X2016_12_11a_new,X2016_12_12a_new,X2016_12_13a_new)
```
###Number of records
First we took a look at the number of records in each Middle East country. For the purpose of comparison, we have two options: barcharts and piecharts. However, with more than 20 countries in total, it is difficult to identify the slice for a country with small portion, or even compare it with a smaller slice using piecharts, therefore, we settled with barcharts.
```{r,fig.width=8}
x_group_count <- x %>% 
  filter(Timestamp>as.Date('2016-12-12 00:00:00 UTC')) %>% 
  group_by(COUNTRY) %>% 
  summarise(count=n()) %>% 
  arrange(count)
(x_group_count %>% 
  ggplot(aes(y=count,x=as_factor(COUNTRY)))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Number of Records')+
  xlab('Country Name')+
  ggtitle('Number of Records by Country')) %>% 
  ggplotly
```

In order to draw this graph, first we filter the data to drop the rows that does not belong to these three days. Then we group the data by country and count the number of each country. We want the bar chart to be sorted by the number of records, so we arrange the count variable and use as_factor() function to remain this order when passing to ggplot. We also flip the coordinates for better visualization of the country names.

From the plot above, we can see that Turkey, Saudi Arabia and United Arab Emirates have the highest number of records, approaching 1.5 million. All of the three countries rank the top in the Middle East Total GDP Ranking list, so it makes sense that these countries have the largest numbers of active mobile users. In general, the smaller or the poorer countries tend to have fewer records. However, the number of records in small countries like Cyprus is more than four times of that in large countries like Iran.

There are two obvious factors associated with the number of records: national population and GDP, and we are going to explore them one by one. Here, we first start with population. By dividing the number of records by the national population, we obtain the proportion of the active users in the national population, which could be a measure of how prevalent mobile usage is in each country.

```{r, message=FALSE, warning=FALSE,fig.width=8}
population <- "http://www.worldometers.info/world-population/population-by-country/" %>% 
  read_html %>% 
  html_nodes('table') %>% 
  .[[1]] %>% 
  html_table() %>% 
  .[,2:3]
colnames(population)[2] <- 'Population'
population$Population <- population$Population %>% gsub(',','',.) %>% as.numeric()
a <- match(x_group_count$COUNTRY,population$`Country (or dependency)`)
a[a %>% is.na %>% which] <- c(16,61,121,17)
x_group_count$Population <- population$Population[a]
(x_group_count %>% 
  mutate(percentage=100*count/Population) %>% 
  ggplot(aes(y=percentage,x=as_factor(COUNTRY)))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Number of Records/Population')+
  xlab('Country Name')+
  ggtitle('Number of Records to Population by Country')) %>% 
  ggplotly
```

In order to draw this graph, we first scrape world population information from Internet, and then match the table onto our original data according to country name. We divide the number of records by the population and times 100 to get the percentage, then draw the graph use the same technique as the previous part. An alternative option is to use the number of unique ID as the numerator to get the percentage. However, we want the number of internet request each user made can also be included in this index, thus we choose to use the number of records to get the percentage.

We can know from the plot that Turkey, Saudi Arabia and United Arab Emirates do not rank the highest any more, instead Cyprus has an enormously larger number than other countries: nearly a half of the country population has made internet request in three days. We think this is because the fact that Cyprus is a small island country with great tourism resources. People from nearby countries love to go there for a little break, especially Europeans. So it is highly possible that the large number of records consist of great many of foreign tourists. Other than Cyprus, it seems that richer countries generally have more records per person than poorer ones.

To confirm this finding, we choose to examine the ratio of number of active users to the total GDP, which can be a measure of the relative development of mobile industry to the whole industry in each country. We find the total GDP data from world bank data set, and divide the number of records to total GDP.

```{r, message=FALSE, warning=FALSE}
url <- 'http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv'
temp <- tempfile()
download.file(url, temp, mode="wb")
unzip(temp, "API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv")
totalgdp <- read_csv("API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv",skip = 4)[,c(1,2,60)]
unlink(temp)
```

```{r, message=FALSE, warning=FALSE,fig.width=8}
totalgdp$Country <- codebook$COUNTRY[match(totalgdp$`Country Code`,codebook$`A3 (UN)`)]
a <- match(x_group_count$COUNTRY,totalgdp$Country)
x_group_count$gdp <- totalgdp$`2015`[a]
(x_group_count %>% 
  mutate(percentage=count/gdp) %>% 
  ggplot(aes(y=percentage,x=as_factor(COUNTRY)))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Number of Records/Total GDP')+
  xlab('Country Name')+
  ggtitle('Number of Records to Total GDP by Country')) %>% 
  ggplotly
```

In order to draw this graph, we first automatically download the GDP data from World Bank Open Data, use the same way as above to match it to the original data, and then divide the number of records by total GDP. An alternative option here is to use GDP per capita, and then we will get a measure of an Engel-coefficient-like index of the mobile phone usage. However here we want to focus on the country level analysis, so we just go with the relative development of mobile industry.

We can see that again Cyprus surpasses the other countries by a huge amount. Since it is a small island country which do not have much potential for agriculture or industry, it is reasonable to suggest that mobile industry has a higher relative development in Cyprus than in other Middle East countries, which is consistent with the earlier reasoning that Cyprus have prosperous tourism. However, there are still other possible explanations for this pattern, for example, the data itself might come from a single carrier, which could be based in Cyprus. Then, the huge volume of records in Cyprus would make more sense, since in other countries, people may use other major carriers and such great number of records is invisible in this dataset.

After comparing the data across countries, we now can compare them across time.

First, we want to see how the number of records varies across time, so we plot two animation interactive graphs that evolve as time goes by. 

This is a screen shot of the interactive map(In [shiny app](https://zhirui.shinyapps.io/mobile_signal/), it is **Number of Records in Mideast** map tab):

![](./GIF/Mideast Map.gif)


We use leaflet to draw this Choropleths graph. We download geojson data of the world country polygons, use geojsonio::geojson_read() to read in the data as a SpatialPolygonsDataFrame object, and then concatenate our count data into this object. Because the value of number of records vary a lot across countries, we have to use uneven color bar to visualize the values. We add a slider selector to the graph, to select the hour of the data. We also make the graph animation, which controlled by a "play/pause" button. We use addPolygons() function to draw SpatialPolygonsDataFrame onto the map. We set many parameters in this function to make this graph more pretty, such as making the polygon transparent in order to see the country name in the background clearly, make country boundary white and dashed line in order to make it looks like hand-made, and add description of each country when the cursor are on that polygon.

This is a screen shot of the interactive barchart(In [shiny app](https://zhirui.shinyapps.io/mobile_signal/), it is **Number of Records in Mideast** barchart tab):

![](./GIF/number mideast.gif)


This graph is very similar to the static bar chart of the number of records, except we add another dimension of hour into this graph to make it animation. We also add a slider selector to the graph, to select the hour of the data and use filter() function to select the data of that hour, and then render the plot. 

From the plot above, we can see that the number of records touch the bottom at dawn, start to increase as time goes by, and then reach the peak at midnight, which makes perfect sense. At dawn, there are few people still awake, while almost all people are asleep. Then, when people start to get up and begin the day, people start to use mobile for all kinds of things. However, in the daytime, people have to work, study or just run errands, so after they getting off work, finishing schoolwork for the day, having a great dinner with family, putting their children to bed, the mobile usage peak occurs.

###Accuracy
Besides the number of records, the other interesting feature in this data set is the accuracy. It can be seen as the 'confidence interval' of the base stations in different countries, which can also be affected by the mobile device itself.
```{r,fig.width=8}
x_group_accuracy <- x %>% 
  filter(Timestamp>as.Date('2016-12-12 00:00:00 UTC')) %>% 
  group_by(COUNTRY) %>% 
  summarise(mean_Accuracy=mean(Accuracy),st_accuracy=sd(Accuracy))
(x_group_accuracy %>% 
  arrange(mean_Accuracy) %>% 
  ggplot(aes(y=mean_Accuracy,x=as_factor(COUNTRY)))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Mean Accuracy')+
  xlab('Country Name')+
  ggtitle('Mean Accuracy by Country')) %>% 
  ggplotly
```

In order to draw this graph, first we filter the data to drop the rows that does not belong to these three days. Then we group the data by country and calculate the mean and standard deviation of accuracy of each country. We also sort the bar chart by the mean accuracy.

From the plot above, there are four countries have mean accuracy over 2000 meters: Libya, Iraq, South Sudan and Democratic Republic of Congo. All of them are in upheaval or experienced huge turbulence. The smoke from the wars could really affect the mobile signal, hence, affect the accuracy. Additionally, in unstable countries like these, there is no extra money, people, resources or motivations for someone to develop the mobile industry, both mobile phones and base stations. So they are expected to have highest accuracy. On the other hand, Cyprus again beats other countries to be the best in terms of accuracy in Middle East. Besides its stable political situation, the majority of the country is plain, which is beneficial for base stations. Furthermore, as a popular tourist destination, it has the motivation to build a better environment for internet users in order to attract more tourists.

Also we can visualize the mean accuracy change by time.

This is a screen shot of the interactive barchart(In [shiny app](https://zhirui.shinyapps.io/mobile_signal/), it is **Mean Accuracy in Mideast** barchart tab):

![](./GIF/mean accuracy mideast.gif)

This graph is very similar to the static bar chart of the mean accuracy, except we add another dimension of hour into this graph to make it animation. We also add a slider selector to the graph, to select the hour of the data and use filter() function to select the data of that hour, and then render the plot. 

It does not seem to be a clear pattern of how the mean accuracy evolve over time, but we can see that 
South Sudan and Libya varies a lot across time. So we plot the standard deviation of the accuracy across country and across time to have a more clear view.

```{r,fig.width=8}
(x_group_accuracy %>% 
  arrange(st_accuracy) %>% 
  ggplot(aes(y=st_accuracy,x=as_factor(COUNTRY)))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Standard Deviation of Accuracy')+
  xlab('Country Name')+
  ggtitle('Standard Deviation of Accuracy by Country')) %>% 
  ggplotly
```

We use the data frame we generated from the previous part as input to plot this graph. We ordered the countries by the value of standard deviation of accuracy and draw the bar chart.

South Sudan has the highest standard deviation, followed by Libya and Syrian, which have much to do with their unstable political situations. However, the rest of countries have similarly high standard deviation. So it is possible that Middle East countries generally do not have sophisticated techniques and well-developed infrastructure, which leads to the poor accuracy with great standard deviation.

```{r,fig.width=8}
(x %>% 
  filter(Timestamp>as.Date('2016-12-12 00:00:00 UTC')) %>% 
  mutate(Hour=hour(Timestamp)) %>% 
  group_by(Hour) %>% 
  summarise(st_accuracy=sd(Accuracy)) %>% 
  mutate(hour=Hour %>% as.character) %>% 
  ggplot(aes(y=st_accuracy,x=hour %>% as_factor))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Standard Deviation of Accuracy')+
  xlab('Hour')+
  ggtitle('Standard Deviation of Accuracy across Time')) %>% 
  ggplotly
```
In order to draw this graph, first we filter the data to drop the rows that does not belong to these three days. Then we group the data by hour and calculate the standard deviation of accuracy of each hour. We also have to use as_factor to maintain the numerical order of the hours, or in ggplot after we convert the hour into character, it will arrange 10, 11 and 12 after 1, not after 9.

From the earlier analysis on the number of records across time, we naturally assumed that the accuracy has the similar pattern. However, based on the plot above, there is really no significant difference in standard deviation of accuracy across time. So there is a strong indication that accuracy mainly depends on base stations itself other than how many people are using the mobile. 

##Russia
###Number of Records and Accuracy
First we draw a bar chart of the number of records per hour
```{r, message=FALSE, warning=FALSE}
X2016_12_10b_new <- read_csv("C:/Users/wang_/Desktop/2016-12-10b_new.csv",progress=F)
X2016_12_11b_new <- read_csv("C:/Users/wang_/Desktop/2016-12-11b_new.csv",progress=F)
X2016_12_12b_new <- read_csv("C:/Users/wang_/Desktop/2016-12-12b_new.csv",progress=F)
x <- bind_rows(X2016_12_10b_new,X2016_12_11b_new,X2016_12_12b_new)
```

```{r,fig.width=8}
(x %>% 
  mutate(Hour=hour(`Time Stamp`)) %>% 
  group_by(Hour) %>% 
  summarise(count=n()) %>% 
  ggplot(aes(y=count,x=Hour %>% as.character))+
  geom_bar(stat="identity",fill='skyblue2')+
  coord_flip()+
  ylab('Number of Records')+
  xlab('Hour')+
  ggtitle('Number of Records across Time')) %>% 
  ggplotly
```
The graph plotting process is almost the similar to the counterpart in Mideast. The only difference is that there is no data that is not in that day, so we do not need to clean it first.

The internet usage behavior in Russia is very different from that in Mideast. The number of users peaks in the afternoon, decreases as time goes by and reaches the bottom at night, which is the opposite of the Mideast. It might be that Russians tend to start their day early, and finish their jobs around the noon. Then after the early lunch, they like to have some afternoon tea with colleagues or friends, where they chat, laugh, do selfies, post on social media like Facebook or Instagram. Also, Russians are kind of traditional, so after work or school, they tend to go home and have some quality time with families, like watching a movie together, reading books to children, or even taking a dance lesson with spouse etc. 

We can also see this relationship when we plot that on the map.
This is a screen shot of the animation map (In [shiny app](https://zhirui.shinyapps.io/mobile_signal/), it is **Accuracy in Russia** map tab):

![](./GIF/accuracy russia.gif)

In this graph we have billions of points to visualize, when we try to use interactive graphing to visualize it, the R will crash halfway, thus we have to use static graphing method to plot the map, and combine the map over time to make animation. Here we use ggmap package to draw the static map. The range of the Accuracy is very huge, so we use log-transformation on the Accuracy in order to get a more clear color visualization. It will took several minutes to draw a map for one  of the ten hours, so we use parallel computing method in R to plot the map in parallel in order to plot the map faster. We save each map as png, and read each hour's map in shiny using an hour slider selector.

As shown in the map, the mobile phone users are all clustered in the south east part of the country, but there is even no single request in most part of the Siberia. It is the same as we expected since the majority of Russians live in the Southeastern region, while Siberia is bare due to its unbearable weather.

Also, we can see the pattern discussed earlier clearly showing in the map: the number of dots peaks in the afternoon and decreases as time goes by.

In addition, we can see from the map that in the south east part of the country there are many red dots, while in the north east part most of the dots are light yellow. To quantify this statement, we can generate a linear regression to see the relationship between accuracy and latitude and longitude.
$$
  Accuracy = \beta_{0} + \beta_{1}Latitude + \beta_{2}Longitude + \beta_{3}Latitude\times Longitude + \epsilon 
$$

```{r, message=FALSE, warning=FALSE}
reg <- biglm(Accuracy~Lattitude+Longitude+Lattitude*Longitude,data=x)
reg %>% coeftest %>% as.table %>% kable
```

In this linear regression we use biglm() instead of lm() because we have billions of data points, the regular lm() function will generate a huge object up to 9 Gigabytes and eat up all the memory. However biglm() will overcome this problem by gradually update the coefficient matrix by calculate the data row by row. 

And the coefficient of the regression shows it is the same as we visualized. As latitude or longitude increases 1 degree, the accuracy will decrease 0.36 and 0.32 respectively, both statistically significant. The interactive term is significant as well, while the magnitude is so small and we can ignore that. We know that the latitude can vary 40 degrees and the longitude can vary 140 degrees across Russia, which can bring substantial difference in the accuracy across broad.

###Internet Request Motion in Moscow
In this part, we want to visualize the motion of the people who make mobile internet request in Moscow. We select the User IDs that have been to five famous landmarks in Moscow, who can be seen as tourists in Moscow, and visualize their mobile internet request across hours in one day. Our data only contains hour 13-22, so we can only visualize these ten hours. Notice that there may be hundreds of internet requests, but the number of unique ID is around 20, and is shown on the upper-right panel. The selector of hour and landmark is also in this panel. The description of each data point will be shown after the icon of that point is clicked, which will be User ID and Time Stamp.  

This is a screen shot of the interactive map (In [shiny app](https://zhirui.shinyapps.io/mobile_signal/), it is **Internet Request in Moscow** map tab):

![](./GIF/motion russia.gif)

To draw this map and read it clearly, we have to find a map that provide English location names on Russia territory. Unfortunately all the default map providers in leaflet does not have English location names on foreign countries, thus we use Mapbox as our custom map provider to plot this map. We want to see the clusterness of the internet request, so we use marker clusters when drawing individual points on to the map, they will show the number of points in one cluster. When we zoom in, the cluster will spread out and finally showing individual points. 

First let us look at the general pattern of internet request of those people who has been into these five landmarks. The numbers of unique ID and the numbers of request peaks at afternoon and hits the bottom at night with no record in hour 22. This might be due to at night tourists go back to their hotel and connect to the hotel WiFi, so they do not need to make internet request via telecom.  

The next thing that is worthy noticing is there are two airports in Moscow: Vnukovo International airport on the bottom-left of the map, and the Sheremetyevo International Airport on the upper-left of the map. Almost all the internet requests are during hour 13 to 17. In each of these hours the ID in that airport is not the same, which means this airport is receiving new tourist every hour. There is no record in the evening at the airport, maybe it is due to people who arrive in Moscow at night will not go to the landmarks, probably go straight to their hotels.   

Most of the internet records are clustered at the center of Moscow, the area around Kremlin and Red Square etc. This shows that no matter what place people visit in Moscow, the landmarks at the central Moscow are the must-visited places for most of the tourists to Moscow.   

Also, we notice that many of the cluster of internet request is on a bridge or at the waterfront, this may be due to the bridge and waterfront are great places for photography, people may take photos there using their mobile phones, and then upload onto the social media, which requires internet request.   

Now, we take a look at the airports from where our tourists come. It is a little bit surprised that people visiting St. Basil’s Cathedral all come from Vnukovo International Airport, since Sheremetyevo International Airport handles as twice as many passengers than it.  Similar situation happens to tourist who visited GUM Department Store (aka. Glavny Univeralny Magazin), with only 7 records occur in Sheremetyevo International Airport. Another significant finding is that there is no Internet usage record in any airports from people who visit Patriarch’s Pond. It is possible that they are all local citizens who just go to the pond for a walk instead of tourists. After all, it is not a historical architecture you could not find anywhere other than Moscow or a must-go-to in Russia. As for Bolshoi Theatre and Moscow Metro, there is no noticeable difference between the numbers of records in both airports.


#6 Conclusion
There may be some limitations in this project. First, the data is the mobile internet request data from one carrier in Mideast and Russia, we do not know whether the user of this carrier can represent all the people in that country, as there might be systematic bias in the sample, say, poor people tend to use cheaper carrier. Maybe we should collect more information about this carrier and the users’ characteristics of the carrier to make a further discussion.

Second, we use the number of request records in our analysis, which might reveal some important information in the data set. In the analysis of population and GDP of Mideast countries, I have mentioned some alternative ways that can use unique IDs to do the analysis. For the internet request motion in Moscow part, we can also use the unique IDs to visualize the motion of people rather than internet request, however, a person may have many requests in many places during one hour, it is hard to determine which point we should choose. Besides, the data is only available when people made internet request, thus it is hard to track a person’s location when he does not make internet request. But these alternatives can be the future directions to try.

Third, there are a few bugs in the shiny apps. When we play the animation interactive map, the background of the map is always flashing because each hour step the shiny generates a new map under the hood. We try to use the JavaScript plugin of leaflet: Leaflet.timeline, but it turns out to have even more bugs than shiny, so we finally choose to use Shiny's own slider selector. Also, the slider selectors of the two maps is hard to drag, and the hardness is different on different group member's computer. We think this might be a bug of shiny itself.

As far as the lesson learned, we could have done better in exploring the deep meaning or patterns of the data. Instead, we spent tons of hours trying to debug these bugs in shiny and javascript, or the more beatiful way to visualize our maps using Shiny. It was really time-consuming and have the least marginal benifit, and drifted us from analyzing the data deeper and discovering possible resaons or stories behind these findings.